Generation of test traffic configuration based on real-world traffic

ABSTRACT

Some embodiments provide a method for generating a test traffic configuration for testing a first network. From a second network, the method receives a set of data streams representing data traffic observed in the second network. The method uses a machine learning engine to analyze the set of data streams in order to determine traffic patterns in the second network. The method generates the test traffic configuration for testing the first network by replicating the traffic patterns of the second network in the first network.

BACKGROUND

Networking is at the heart of application delivery. Modern applicationsare built with the requirement of anywhere, anytime, any device, andany-scale. Most of such applications work seamlessly in an ideal world,but the behavior of these applications might differ under varyingconditions due to changes in infrastructure and diverse trafficpatterns. As such, in a world where the number of connected devices isincreasing every day, the importance of a solid IT infrastructure isparamount and inevitable. With this rate of growth, it becomesabsolutely essential to ensure the scalability and reliability of anadvanced application network appliance (e.g., load balancer, virtualizednetwork function, or other devices) that handles traffic for real-worldapplications.

As such, application deployments (including any load balancer devicesthat handle traffic for the application) are typically tested beforereal-world deployment, using a traffic testing tool. Most modern trafficgenerators are capable of performing stress testing, but lack diversityin traffic generation that one would see in a real-world productionenvironment. Characterizing traffic patterns in such a deployment isnon-trivial. Understanding the ever-changing, non-deterministic natureof traffic is of paramount importance in making the application robust.Today's network traffic generators have configurable parameters thatinclude initial traffic load, ramp-up, and ramp-down for a given set ofconnections/users. In order to emulate real-world network trafficpatterns, these parameters need to be experimented with and tweakedconstantly. This is not only time consuming but also arduous to scalefor multiple applications. As such, improved mechanisms for simulatingreal-world traffic patterns would be beneficial.

BRIEF SUMMARY

Some embodiments provide a method for generating a test trafficconfiguration for testing a first network (e.g., a test network hostinga new release or a testing deployment for an application) based onreplicating traffic patterns of a second network (e.g., a networkhosting a real-world production deployment of the application).Specifically, some embodiments receive a set of data streamsrepresenting data traffic observed in the second network, use a learningengine to analyze these data streams to determine the traffic patternsof the second network, and generate the test traffic configuration toreplicate these determined traffic patterns. The test trafficconfiguration is provided to a test traffic generation system that teststhe first network by generating data traffic replicating that observedin the second network.

The test traffic configuration is generated, in some embodiments, by atraffic profiler that includes a machine learning engine. The trafficprofiler uses the machine learning engine to replicate the trafficpatterns of the second network and provides the test trafficconfiguration to a test traffic controller which either operates in thefirst network or a network that connects to the first network. The testtraffic controller configures a set of test traffic sources (again,either in the first network or a network that connects to the firstnetwork) to generate the test traffic according to the test trafficconfiguration, thereby replicating the real-world conditions observed inthe second network. These test traffic sources generate and send testtraffic to the application deployment being tested, thereby enabling theapplication deployment to be tested with a simulation of real-worldtraffic. In different embodiments, the test traffic sources send thistraffic either directly to the application or to a device or set ofdevices (e.g., load balancer(s), gateway(s), or other devices thathandle traffic from outside sources directed to the application).

In some embodiments, the traffic profiler receives the set of datastreams representing data traffic observed in the second network fromone or more devices in the second network, such as a load balancer,gateway, etc. that processes data traffic from outside devices (e.g.,client devices such as laptop/desktop computers, mobile devices, etc.)that connect to the application in the second network. These datastreams may arrive in different formats from the second network (e.g.,s-flow, IPFIX, PCAP files, netflow, etc.), and may include differenttypes of data (e.g., requests per second, connections per second,cookies, source and destination addresses for specificpackets/connections, network characteristics such as latency, jitter,and packet loss, protocol-specific parameters, application-specificinformation, etc.). In some embodiments, the traffic profiler includes atraffic data translator that translates these disparate formats into acommon format for the learning engine (e.g., using JSON, XML, YAML,etc.). In addition, the translator of some embodiments removespersonally identifiable information as needed.

Typically, especially when the application deployment in the secondnetwork is at or near full traffic load, not all of the traffic can bereported to the traffic profiler. That is, only a subset of the trafficis typically sampled, though additional data such asconnections/requests per second can also provide information about thetraffic observed at the second network. The traffic profiler (e.g., thelearning engine of the traffic profiler) also uses various techniques toapproximate the lost data, thereby recovering a fuller picture of theobserved traffic at the second network.

Using this data, the learning engine applies various machine learningtechniques to the data in order to determine the traffic patterns at thesecond network. Specifically, in some embodiments, the learning enginereduces the dimensionality (e.g., using a principal component analysisalgorithm based on singular value decomposition) of the data, thenapplies a clustering algorithm to determine profiles of traffic overtime. Based on these clustering patterns, the traffic profiler generatesthe test traffic configuration for the test traffic generation system touse.

The preceding Summary is intended to serve as a brief introduction tosome embodiments of the invention. It is not meant to be an introductionor overview of all inventive subject matter disclosed in this document.The Detailed Description that follows and the Drawings that are referredto in the Detailed Description will further describe the embodimentsdescribed in the Summary as well as other embodiments. Accordingly, tounderstand all the embodiments described by this document, a full reviewof the Summary, Detailed Description and the Drawings is needed.Moreover, the claimed subject matters are not to be limited by theillustrative details in the Summary, Detailed Description and theDrawings, but rather are to be defined by the appended claims, becausethe claimed subject matters can be embodied in other specific formswithout departing from the spirit of the subject matters.

BRIEF DESCRIPTION OF THE DRAWINGS

The novel features of the invention are set forth in the appendedclaims. However, for purpose of explanation, several embodiments of theinvention are set forth in the following figures.

FIG. 1 conceptually illustrates a test traffic generation system thatgenerates a test traffic configuration based on traffic patternsobserved in a production network and generates traffic based on thesetraffic patterns in order to test another system.

FIG. 2 conceptually illustrates the architecture of a traffic profilerof some embodiments.

FIG. 3 conceptually illustrates a process of some embodiments forgenerating a test traffic configuration.

FIG. 4 conceptually illustrates an electronic system with which someembodiments of the invention are implemented.

DETAILED DESCRIPTION

In the following detailed description of the invention, numerousdetails, examples, and embodiments of the invention are set forth anddescribed. However, it will be clear and apparent to one skilled in theart that the invention is not limited to the embodiments set forth andthat the invention may be practiced without some of the specific detailsand examples discussed.

Some embodiments provide a method for generating a test trafficconfiguration for testing a first network (e.g., a test network hostinga new release or a testing deployment for an application) based onreplicating traffic patterns of a second network (e.g., a networkhosting a real-world production deployment of the application).Specifically, some embodiments receive a set of data streamsrepresenting data traffic observed in the second network, use a learningengine to analyze these data streams to determine the traffic patternsof the second network, and generate the test traffic configuration toreplicate these determined traffic patterns. The test trafficconfiguration is provided to a test traffic generation system that teststhe first network by generating data traffic replicating that observedin the second network.

FIG. 1 conceptually illustrates a test traffic generation system 100that generates a test traffic configuration based on traffic patternsobserved in a production network 105 and generates traffic based onthese traffic patterns in order to test a system 110. In this example,the test traffic generation system 100 executes in a testing network 115along with the system 110 being tested. However, it should be understoodthat in different embodiments the test traffic generation system 100, orat least part of that system, could be located in a different network.

As shown, the test traffic generation system 100 includes a trafficprofiler 120, a test traffic controller 125, and a set of test trafficsources 130. The traffic profiler 120 of some embodiments, described infurther detail below by reference to FIGS. 2 and 3, uses a machinelearning engine to replicate the traffic patterns observed in theproduction network 105 based on data sent from that production network.The traffic profiler generates a test traffic configuration for thetesting system 100 and provides this to the test traffic controller 125

The test traffic controller 125 configures the various test trafficsources 130 to generate test traffic according to the test trafficconfiguration, thereby replicating the real-world conditions observed inthe production network 105. In some embodiments, the test trafficcontroller provides an interface or other mechanism (e.g., a RESTapplication programming interface (API), command line interface (CLI),graphical user interface (GUI), etc.) for configuration management andtraffic monitoring. This interface is accessible to users (e.g., anetwork admin, application developer testing an application, etc.) aswell as to the traffic profiler 120 in some embodiments, enabling theadministrator to configure, monitor, and manage the configuration oftest traffic sources as well as the operation of the traffic profiler120 and its data source(s) in the production network 105. In someembodiments, the test traffic controller operates as an endpoint foraggregated metrics and statistics reporting, enabling a user to monitorthe performance of the system 110 being tested.

The test traffic sources 130 generate and send test traffic to thesystem 110 being tested, thereby enabling the system 110 to be testedwith a simulation of real-world traffic. In some embodiments, the entirelife cycle of the test traffic sources 130 is managed by the testtraffic controller 125; that is, the test traffic controller 125configures the test traffic sources 130 to stop and start as per thetest traffic configuration and can change the test traffic configurationfor each of the test traffic sources as needed. The number of testtraffic sources 130 can vary based on the test traffic configuration,with the test traffic controller 125 responsible for dynamically scalingthe number of test traffic sources 130 operating at any given time(e.g., if needed to achieve higher throughput or numbers ofconnections/requests per second). In addition to generating traffic thatis sent to the system 110 being tested, in some embodiments the testtraffic sources 130 send real time telemetry collected across varioustarget applications to the test traffic controller 125, in order for thecontroller 125 to analyze these metrics.

The test traffic sources 130, in some embodiments, can generate and sendtraffic with parameters specified for many protocols across the entirenetwork stack from layer 2 to layer 7. This includes parameters forvarious layer 4 protocols such as TCP and UDP, parameters for variouslayer 7 protocols such as HTTP/HTTPS 1.x and 2.0, and securityparameters for SSL/TLS. In addition, these test traffic sources 130 canemulate a real-world browser with persistence validation using cookiesand/or SSL session identifiers. If needed, the test traffic sources 130of some embodiments are able to emulate distributed denial of service(DDoS) traffic in order to test application firewalls.

In some embodiments, the test traffic controller 125 and the testtraffic sources 130 are implemented within containers or virtualmachines executing on host computers in the testing network 115. Thetest traffic controller 125 can be implemented by a single container ordistributed across multiple containers (e.g., as a cluster ofcontrollers). The test traffic sources 130 are each implemented on aseparate container in some embodiments. These separate containers mayexecute on the same host computer or various different host computers inthe testing network 115; in some embodiments, multiple test trafficsources 130 might be implemented on containers that execute in the samevirtual machine (VM). In addition, the test traffic controller 125 mightbe implemented on a container that executes on the same host computer oreven the same VM as one or multiple test traffic sources 130.

The test traffic sources 130 send traffic directed to the system 110being tested. In some embodiments, an application (e.g., a distributedapplication) that receives and handles traffic from external clients isdeployed in the testing network 115. The application being tested mightbe a new release of an existing application, a currently releasedapplication that is experiencing issues in the field, a new applicationbeing prepared for deployment, etc. In some embodiments, the system 110includes a gateway device 135, which may operate as a gateway router,load balancer, firewall, etc., depending on the requirements of theapplication and the network at which the application will be deployed.

Located behind this gateway device 135 are a set of servers 140 on whichthe application being tested is deployed. These servers 140 representthe various containers, virtual machines, etc. on which variousapplication components are deployed. For instance, many applicationdeployments entail multiple microservices executing on separatecontainers that combine to implement an application. Though not shown inthe figure, these services may be arranged in tiers (e.g., a web tier,application tier, and database tier, as is common for a web-accessibleapplication). In some such embodiments, only one of the tiers (e.g., theweb tier) is directly reached through the gateway device 135 by thetraffic sent from the test traffic sources 130, while the other tiersare located behind the web tier.

FIG. 1 also illustrates a production network 105. As shown, this networkalso includes a gateway device 145 and a set of servers 150 on which thedeployed application executes. Like for the system 110 being tested,this gateway device 145 may operate as a gateway router, load balancer,firewall, etc., depending on the requirements of the application and theproduction network 105. In addition, the gateway device 145 isconfigured in some embodiments to monitor traffic received from thepublic network 160 (and/or sent to the public network 160) and exporttraffic data summarizing this received traffic. It should be noted thatwhile this example shows an application that receives traffic fromclient devices 155 through a public network 160, in other examples theapplication might only interact with clients on private networks (e.g.,operating in the same datacenter as the application or operating in adifferent datacenter and connected via VPN to the application).

The application operating in the production network 105 is anapplication currently operating and receiving real-world traffic (e.g.,from client devices 155 via a public network 160). In some embodiments,the application deployed in the production network 105 is implemented inthe same manner as the application under test. That is, the applicationunder test is designed to match the implementation of the applicationoperating in the production network 105 in some embodiments (e.g., thesame microservices operating in the same number of containers and/orVMs, the same tiers, etc.). In other embodiments, the application undertest may differ from the production network application (e.g., if theapplication under test is a new release of the application that differsin certain ways in its implementation).

The operation of the test traffic generation system 100 will now bedescribed. As shown by the encircled 1, the gateway device 145 (oranother device at the production network that monitors traffic) exportsobserved traffic data to the traffic profiler 120. It should be notedthat while shown as a single device 145, in some embodiments theapplication might be deployed in such a way that multiple gateway, loadbalancer, firewall, etc. devices receive application traffic from thepublic network 160. For instance, the application might be a largedeployment in a single network that requires multiple active gateways orload balancers, or might even be deployed across multiple datacentersfor redundancy.

In some embodiments, different types of load balancer or gateway devicesprovide this traffic data to the traffic profiler in different formats,and if the application deployment requires multiple gateway devicesthese devices might use different formats for the export of trafficdata. The traffic data arrives in some embodiments as data streams invarious formats such as s-flow, IPFIX, PCAP files, netflow, etc. Thetraffic data of some embodiments can include different types of datasuch as requests per second, connections per second, cookies, source anddestination addresses for specific packets/connections, etc. In someembodiments, the traffic profiler 120 includes a traffic data translatorthat translates these disparate formats into a common format and removespersonally identifiable information as needed. The traffic profiler 120includes a learning engine as well, in some embodiments, which analyzesthe normalized traffic data to generate a test traffic configurationthat matches the traffic patterns observed in the received traffic data.As mentioned, the operations of the traffic profiler are described ingreater detail below by reference to FIGS. 2 and 3.

As shown by the encircled 2, the traffic profiler 120 provides thisgenerated test traffic configuration to the test traffic controller 125(e.g., via the REST API of the controller 125). The test trafficcontroller 125 receives the test traffic configuration and uses this toconfigure the test traffic sources 130, as shown by the encircled 3.Because the test traffic configuration is based on machine learninganalysis of the actual data traffic, the test traffic sources 130generate and send to the system 100 being tested traffic that replicatesreal-world conditions, so as to better test the application deployed inthe testing network 115.

As mentioned, FIG. 2 conceptually illustrates the architecture of atraffic profiler 200 of some embodiments in greater detail. As shown,the traffic profiler 200 includes a traffic data translator 205 and alearning engine 210. The traffic profiler 200 receives traffic datastreams 215 from one or more devices (e.g., gateways, load balancers,firewalls, etc.) that monitor real-world traffic from a productionnetwork and outputs a test traffic configuration 220 generated to matchthe traffic patterns observed in the production network.

The traffic data translator 205 translates the received traffic datastreams 215 into normalized traffic data 225, also referred to astraffic data tokens. The traffic data streams 215 can be received fromvarious different types of gateway devices and in various differentformats, while the learning engine 210 is designed to process data in aparticular format. Thus, the traffic data translator 205 of someembodiments includes various different data handlers for the differentpossible traffic data formats. In some embodiments, the data handlersare user-defined interfaces for receiving and translating specificformats of data. In some embodiments, the traffic data translator 205 isextensible such that additional data handlers can be defined for newtraffic data formats, so that the traffic profiler can generate trafficprofiles from any type of traffic data stream.

These data handlers might receive PCAP files, gRPC calls, messagequeues, XML/JSON/Protobuf streams, syslogs, netflows, IPFIX data, etc.,with different environments providing different types of data. Differentformats of traffic data streams 215 may include different types of data,including requests per second, connections per second, cookies, sourceand destination addresses for specific packets/connections, etc.

The traffic data translator 205 provides the normalized traffic data 225to the learning engine 210, which processes the normalized traffic data225 to generate a test traffic configuration 220 which is intended toreplicate real-world traffic patterns. As described above, the testtraffic configuration 220 is used by the test traffic controller toconfigure test traffic sources to generate traffic. In some embodiments,the traffic profiler 200 can also be configured to generate a testtraffic configuration for other testing systems (e.g., other trafficgenerators).

In some embodiments, the learning engine 210 performs a number ofsuccessive operations to generate the test traffic configuration basedon real-world traffic patterns. Typically, especially when the deployedapplication in the production network is receiving large amounts oftraffic (the sort of condition that should be replicated accurately inthe test traffic patterns), not all of the traffic can be reported tothe traffic profiler. That is, only a subset of the traffic is typicallysampled by the gateway device (or monitoring device that monitorstraffic at the gateway device), though additional data such asconnections/requests per second can also provide information about thetraffic observed at the second network. The learning engine 210 usesvarious techniques to approximate the lost data, thereby recovering afuller picture of the observed traffic at the second network. Using thisdata, the learning engine 210 applies various machine learningtechniques to the data in order to determine the traffic patterns seenat the production network. Specifically, in some embodiments, thelearning engine 210 reduces the dimensionality (e.g., using a principalcomponent analysis algorithm based on singular value decomposition) ofthe data, then applies a clustering algorithm to determine clusters oftraffic over time. Based on these clustering patterns, the learningengine 210 can generate the test traffic configuration for the testtraffic generation system to use.

FIG. 3 conceptually illustrates a process 300 of some embodiments forgenerating a test traffic configuration. In some embodiments, theprocess 300 is performed by a traffic profiler (e.g., the trafficprofiler 200), primarily by the learning engine. This process usestraffic data streams from one or more real-world sources to generate atest traffic configuration that replicates the traffic patterns seen ina production network.

As shown, the process 300 begins by receiving (at 305) one or more datastreams from one or more production networks. As described above, thesedata streams might be received from various different types of gatewaydevices and in various different formats (e.g., PCAP files, gRPC calls,message queues, XML/JSON/Protobuf streams, syslogs, netflows, IPFIXdata, etc.). For a particular application deployment, the trafficprofiler will often receive only a single format of traffic data, unlessmultiple different types of gateway and/or monitoring devices are usedin the production network. However, the same traffic profiler might beused to generate test traffic configurations for a variety of differenttesting setups, and therefore could be configured to handle variousdifferent data stream formats.

Next, the process 300 normalizes (at 310) the format of the receiveddata streams. In some embodiments, the traffic profiler (e.g., a trafficdata translator module) includes numerous data handlers for translatingthe various different types of data that could be received into anormalized format recognized by the learning engine. This normalizedformat can be represented in JSON, XML, YAML, or other such format forspecifying normalized data segments (also referred to as “tokens”).

In some embodiments, the normalized data segments include informationspecific to packets or data flows, such as the specific protocols usedfor each layer. As examples, the specific header fields present in apacket or flow (and/or the values for these packets) could be specified,including the Ethernet fields present, some or all of the IP fieldspresent, the TCP and/or UDP fields present, the HTTP(S) fields present,etc. In addition, some embodiments specify cookies, protocol versions,L7 data such as URLs, packet length, etc. The normalized data segmentscan also include network characteristics (e.g., latency, jitter, packetloss, etc.) if that information is received by the traffic profiler in adata stream. In addition, application-specific data (e.g., payload size,patterns, amounts of data in different directions, etc.) can also beincluded in these data segments.

The learning engine performs various operations on the normalized datato generate a test traffic configuration that matches the trafficpatterns observed in the real-world production network setting. Itshould be noted that, while the process 300 is illustrated as a linearprocess, this is a conceptual process. In some embodiments, the trafficprofiler regularly receives traffic data streams and normalizes thisdata. At periodic intervals (e.g., based on regular time intervals,receipt of a threshold amount of traffic data, etc.), the learningengine performs its operations to determine patterns in the trafficdata. This periodic analysis by the learning engine helps to minimizecomputation and memory overhead in the traffic profiler.

As shown, the process 300 analyzes (at 305) the received normalized datato fill in traffic data not included in the streams. Especially when theproduction network deployment of an application is at or near its fulltraffic load, the data stream(s) from the gateway device(s) at thatnetwork will not include information about every packet received for theapplication. Instead, many traffic monitoring devices sample the datapackets received at the device and report traffic data based on thesesamples, which should be representative of the entirety of the trafficseen at the device. The learning engine of some embodiments tries torecover the missing information by, e.g., using techniques ofmulti-variate solvers that treat the information in the data stream as amatrix. Some such embodiments transform this matrix into a row-echelonform and solve for the unknown data. If the equations are consistent,this technique yields approximated optimal values for the missing data.Some embodiments use Cholesky decomposition and convex optimizationtechniques for this purpose.

As a simplistic example, data from a gateway device might includeinformation about three types of traffic. The first type includes mcookies, n headers, and has a size of a bytes, with a relative ratio inthe data streams of A (compared to other types of traffic). The secondtype includes o cookies, p headers, and has a size of b bytes, with arelative ratio in the data streams of B. The third type of data includesq cookies, r headers, and has a size of c bytes, with a relative ratioin the data streams of C. From this the traffic profiler can concludethat xA+yB+zC=(requests per second)*(time interval) and thatxaA+ybB+zcC=(throughput)*(time interval). Solving for these equationsenables the traffic profiler to determine the amount of each type oftraffic received during the time interval, thereby approximating themissing traffic data.

Next, the process reduces (at 320) the dimensionality of the trafficdata. The normalized traffic data includes information across a numberof dimensions, and the learning engine of some embodiments determinesthe optimal number of dimensions to which this information can bereduced and performs this dimension-reduction operation. In someembodiments, the learning engine uses principal component analysis (PCA)to reduce the dimensionality of the traffic data by determining whichinformation in the traffic data is important and which information isnoise.

Specifically, some embodiments use a PCA function that takes as inputthe target dimensionality for the traffic data and a solver algorithm.To determine the target dimensionality, some embodiments use maximumlikelihood estimation (MLE). For the solver algorithm, some embodimentsuse a singular value decomposition (SVD) implementation. Reducing thedimensionality of the traffic data using such a method helps to minimizecomputational and memory overhead without a loss of information, and caneven improve the learning engine output by reducing an overfittingeffect resulting from having too many dimensions.

As an example, the traffic data before dimensionality reduction couldinclude n dimensions such as headers, cookies, HTTP version, TLSversion, query parameters, URL, packet body length, requests per second,connections per second, and throughput (and potentially different,fewer, or more parameters). Rather than using each of these dimensionsas a separate dimension for learning, some embodiments define aprincipal dimension that is used to avoid overfitting. In someembodiments, the principal dimensions does not correspond to a singledefinable parameter (such as headers, cookies, etc.), but rather to acombination of these parameters.

In some such embodiments, this combination is determined by identifyingrelated parameters in the traffic data, associating entropy factors withthe dimensions, and deriving a single parameter representing theserelated parameters. As an example, a sample set of traffic data mightinclude seven dimensions a, b, c, d, e, g, and h, with parameters a, b,and c related (and therefore candidates for reduction). The dimensionreduction operation associated entropy factors e1, e2, and e3 withparameters a, b, and c respectively. A new dimension m is defined suchthat m=f(e1*a, e2*b, e3*c). Thus, the new set of traffic data has fivedimensions m, d, e, g, and h (and this set can be further reduced).

The process then performs (at 325) clustering analysis to determine thetraffic pattern observed at the production network(s). Because thetraffic data has a high volume of data and is often repetitive acrosstime, this data can be clustered together for a concise representationof the observed traffic. Some embodiments use density-based spatialclustering of application and noise (DBSCAN), which is a time-seriesclustering algorithm that is resilient towards outlier data. The DBSCANalgorithm takes as parameters (i) a number of data points in theneighborhood of a particular data point in order for that data point tobe considered as a core data point and (ii) the maximum distance betweentwo data points for one data point to be considered in the neighborhoodof another data point. Based on the values for these parameters, theclustering algorithm of some embodiments finds all of the pairs ofneighbor data points within the set maximum distance of each other toidentify the core data points with the requisite number of neighbor datapoints. For each core data point, the algorithm creates a new cluster ifthat core data point is not already assigned to a cluster (multiple datapoints in the same cluster might satisfy the requirements to be coredata points). The clustering algorithm then, for each core data point,recursively finds all of the densely connected data points and assignsthese to the same cluster as the core data point. This algorithm of someembodiments iterates through the data points until all data points areevaluated.

To determine the optimal value of the maximum distance between two datapoints in the above clustering algorithm (i.e., the size of each datapoint's neighborhood), some embodiments use a nearest neighborsalgorithm that is a neighbor searcher. This algorithm calculates, foreach data point, the distance to the nearest X number of points andsorts the data points based on this calculation. The point at which thechange in the distance between data points is the largest is used forthe optimal value of the maximum distance of the neighborhood in theclustering algorithm. Some embodiments use the number of dimensions ofthe reduced-dimension traffic data (or the number of dimensions+1) asthe value of X in the nearest neighbors algorithm.

Based on the clustering analysis, the process 300 generates (at 330) atest traffic configuration for the test traffic generation system to usein the test network. In some embodiments, the periodic analysisperformed by the learning engine results in a single cluster patternbeing separate across multiple analysis timeframes. Some embodimentscompare clusters across multiple consecutive timeframes, with trafficclusters that span these timeframes stitched together with each other.In some embodiments, the test traffic configuration is output as atime-series traffic representation. This configuration includesinformation specifying when to start, stop, and update traffic patternslearned from the clustering algorithm.

FIG. 4 conceptually illustrates an electronic system 400 with which someembodiments of the invention are implemented. The electronic system 400may be a computer (e.g., a desktop computer, personal computer, tabletcomputer, server computer, mainframe, a blade computer etc.), phone,PDA, or any other sort of electronic device. Such an electronic systemincludes various types of computer readable media and interfaces forvarious other types of computer readable media. Electronic system 400includes a bus 405, processing unit(s) 410, a system memory 425, aread-only memory 430, a permanent storage device 435, input devices 440,and output devices 445.

The bus 405 collectively represents all system, peripheral, and chipsetbuses that communicatively connect the numerous internal devices of theelectronic system 400. For instance, the bus 405 communicativelyconnects the processing unit(s) 410 with the read-only memory 430, thesystem memory 425, and the permanent storage device 435.

From these various memory units, the processing unit(s) 410 retrieveinstructions to execute and data to process in order to execute theprocesses of the invention. The processing unit(s) may be a singleprocessor or a multi-core processor in different embodiments.

The read-only-memory (ROM) 430 stores static data and instructions thatare needed by the processing unit(s) 410 and other modules of theelectronic system. The permanent storage device 435, on the other hand,is a read-and-write memory device. This device is a non-volatile memoryunit that stores instructions and data even when the electronic system400 is off. Some embodiments of the invention use a mass-storage device(such as a magnetic or optical disk and its corresponding disk drive) asthe permanent storage device 435.

Other embodiments use a removable storage device (such as a floppy disk,flash drive, etc.) as the permanent storage device. Like the permanentstorage device 435, the system memory 425 is a read-and-write memorydevice. However, unlike storage device 435, the system memory is avolatile read-and-write memory, such a random-access memory. The systemmemory stores some of the instructions and data that the processor needsat runtime. In some embodiments, the invention's processes are stored inthe system memory 425, the permanent storage device 435, and/or theread-only memory 430. From these various memory units, the processingunit(s) 410 retrieve instructions to execute and data to process inorder to execute the processes of some embodiments.

The bus 405 also connects to the input and output devices 440 and 445.The input devices enable the user to communicate information and selectcommands to the electronic system. The input devices 440 includealphanumeric keyboards and pointing devices (also called “cursor controldevices”). The output devices 445 display images generated by theelectronic system. The output devices include printers and displaydevices, such as cathode ray tubes (CRT) or liquid crystal displays(LCD). Some embodiments include devices such as a touchscreen thatfunction as both input and output devices.

Finally, as shown in FIG. 4, bus 405 also couples electronic system 400to a network 465 through a network adapter (not shown). In this manner,the computer can be a part of a network of computers (such as a localarea network (“LAN”), a wide area network (“WAN”), or an Intranet, or anetwork of networks, such as the Internet. Any or all components ofelectronic system 400 may be used in conjunction with the invention.

Some embodiments include electronic components, such as microprocessors,storage and memory that store computer program instructions in amachine-readable or computer-readable medium (alternatively referred toas computer-readable storage media, machine-readable media, ormachine-readable storage media). Some examples of such computer-readablemedia include RAM, ROM, read-only compact discs (CD-ROM), recordablecompact discs (CD-R), rewritable compact discs (CD-RW), read-onlydigital versatile discs (e.g., DVD-ROM, dual-layer DVD-ROM), a varietyof recordable/rewritable DVDs (e.g., DVD-RAM, DVD-RW, DVD+RW, etc.),flash memory (e.g., SD cards, mini-SD cards, micro-SD cards, etc.),magnetic and/or solid state hard drives, read-only and recordableBlu-Ray® discs, ultra-density optical discs, any other optical ormagnetic media, and floppy disks. The computer-readable media may storea computer program that is executable by at least one processing unitand includes sets of instructions for performing various operations.Examples of computer programs or computer code include machine code,such as is produced by a compiler, and files including higher-level codethat are executed by a computer, an electronic component, or amicroprocessor using an interpreter.

While the above discussion primarily refers to microprocessor ormulti-core processors that execute software, some embodiments areperformed by one or more integrated circuits, such as applicationspecific integrated circuits (ASICs) or field programmable gate arrays(FPGAs). In some embodiments, such integrated circuits executeinstructions that are stored on the circuit itself.

As used in this specification, the terms “computer”, “server”,“processor”, and “memory” all refer to electronic or other technologicaldevices. These terms exclude people or groups of people. For thepurposes of the specification, the terms display or displaying meansdisplaying on an electronic device. As used in this specification, theterms “computer readable medium,” “computer readable media,” and“machine readable medium” are entirely restricted to tangible, physicalobjects that store information in a form that is readable by a computer.These terms exclude any wireless signals, wired download signals, andany other ephemeral signals.

This specification refers throughout to computational and networkenvironments that include virtual machines (VMs). However, virtualmachines are merely one example of data compute nodes (DCNs) or datacompute end nodes, also referred to as addressable nodes. DCNs mayinclude non-virtualized physical hosts, virtual machines, containersthat run on top of a host operating system without the need for ahypervisor or separate operating system, and hypervisor kernel networkinterface modules.

VMs, in some embodiments, operate with their own guest operating systemson a host using resources of the host virtualized by virtualizationsoftware (e.g., a hypervisor, virtual machine monitor, etc.). The tenant(i.e., the owner of the VM) can choose which applications to operate ontop of the guest operating system. Some containers, on the other hand,are constructs that run on top of a host operating system without theneed for a hypervisor or separate guest operating system. In someembodiments, the host operating system uses name spaces to isolate thecontainers from each other and therefore provides operating-system levelsegregation of the different groups of applications that operate withindifferent containers. This segregation is akin to the VM segregationthat is offered in hypervisor-virtualized environments that virtualizesystem hardware, and thus can be viewed as a form of virtualization thatisolates different groups of applications that operate in differentcontainers. Such containers are more lightweight than VMs.

Hypervisor kernel network interface modules, in some embodiments, is anon-VM DCN that includes a network stack with a hypervisor kernelnetwork interface and receive/transmit threads. One example of ahypervisor kernel network interface module is the vmknic module that ispart of the ESXi™ hypervisor of VMware, Inc.

It should be understood that while the specification refers to VMs, theexamples given could be any type of DCNs, including physical hosts, VMs,non-VM containers, and hypervisor kernel network interface modules. Infact, the example networks could include combinations of different typesof DCNs in some embodiments.

While the invention has been described with reference to numerousspecific details, one of ordinary skill in the art will recognize thatthe invention can be embodied in other specific forms without departingfrom the spirit of the invention. In addition, a number of the figures(including FIG. 3) conceptually illustrate processes. The specificoperations of these processes may not be performed in the exact ordershown and described. The specific operations may not be performed in onecontinuous series of operations, and different specific operations maybe performed in different embodiments. Furthermore, the process could beimplemented using several sub-processes, or as part of a larger macroprocess. Thus, one of ordinary skill in the art would understand thatthe invention is not to be limited by the foregoing illustrativedetails, but rather is to be defined by the appended claims.

1-20. (canceled)
 21. A method for generating a test trafficconfiguration for testing a first application in a first network priorto deploying the first application, the method comprising: from a firstgateway device operating in a second network, receiving a set of datastreams representing data traffic that is (i) sent to a secondapplication deployed in the second network and (ii) observed at thefirst gateway device entering the second network from an externalnetwork; using a machine learning engine to analyze the set of datastreams in order to determine traffic patterns for the data trafficentering the second network from the external network; and based on thedetermined traffic patterns, generating a test traffic configuration fortesting a system comprising a second gateway device and the firstapplication in the first network prior to deployment of the firstapplication.
 22. The method of claim 21, wherein: the second network isa production network and the data traffic represented by the set of datastreams is real-world data traffic received by the first gateway devicefrom client devices accessing the second application via a publicnetwork; and the first network is a test network and the firstapplication is a new release of the second application.
 23. The methodof claim 21, wherein: the second network is a production network and thedata traffic represented by the set of data streams is real-world datatraffic received by the first gateway device from client devicesaccessing the second application via a public network; and the firstnetwork is a test network and the first application is another instanceof the second application.
 24. The method of claim 21 further comprisingproviding the generated test traffic configuration to a testingcontroller that uses the test traffic configuration to configure a setof test traffic sources.
 25. The method of claim 24, wherein the testtraffic sources are network endpoints deployed in the first network thatgenerate and send data traffic to the second gateway device based on thegenerated test traffic configuration.
 26. The method of claim 21,wherein: the set of data streams comprises (i) a first data streamhaving a first format and (ii) a second data stream having a secondformat; and the method further comprises normalizing the first andsecond data streams into a same standardized format for consumption bythe machine learning engine.
 27. The method of claim 21, wherein: eachof the data streams represents the observed data traffic using arespective plurality of dimensions; and using the machine learningengine to analyze the set of data streams comprises: determining anoptimal number of dimensions for the data streams; and reducingdimensionality of the data streams to the determined optimal number ofdimensions for further analysis by the machine learning engine.
 28. Themethod of claim 21, wherein using the machine learning engine to analyzethe set of data streams comprises determining, based on trafficstatistics during a time interval during which the first gateway deviceobserved the represented data traffic, characteristics of additionaldata traffic processed by the first gateway but not represented in thereceived set of data streams.
 29. The method of claim 28, wherein thedetermined traffic patterns are based on the represented data trafficand the additional data traffic determined by the machine learningengine.
 30. The method of claim 21, wherein using the machine learningengine to analyze the set of data streams comprises using a clusteringalgorithm to determine the traffic patterns for the data trafficentering the second network from the external network.
 31. Anon-transitory machine-readable medium storing a program which whenexecuted by at least one processing unit generates a test trafficconfiguration for testing a first application in a first network priorto deploying the first application, the program comprising sets ofinstructions for: receiving, from a first gateway device operating in asecond network, a set of data streams representing data traffic that is(i) sent to a second application deployed in the second network and (ii)observed at the first gateway device entering the second network from anexternal network; using a machine learning engine to analyze the set ofdata streams in order to determine traffic patterns for the data trafficentering the second network from the external network; and generating,based on the determined traffic patterns, a test traffic configurationfor testing a system comprising a second gateway device and the firstapplication in the first network prior to deployment of the firstapplication.
 32. The non-transitory machine-readable medium of claim 31,wherein: the second network is a production network and the data trafficrepresented by the set of data streams is real-world data trafficreceived by the first gateway device from client devices accessing thesecond application via a public network; and the first network is a testnetwork and the first application is a new release of the secondapplication.
 33. The non-transitory machine-readable medium of claim 31,wherein: the second network is a production network and the data trafficrepresented by the set of data streams is real-world data trafficreceived by the first gateway device from client devices accessing thesecond application via a public network; and the first network is a testnetwork and the first application is another instance of the secondapplication.
 34. The non-transitory machine-readable medium of claim 31,wherein the program further comprises a set of instructions forproviding the generated test traffic configuration to a testingcontroller that uses the test traffic configuration to configure a setof test traffic sources.
 35. The non-transitory machine-readable mediumof claim 34, wherein the test traffic sources are network endpointsdeployed in the first network that generate and send data traffic to thesecond gateway device based on the generated test traffic configuration.36. The non-transitory machine-readable medium of claim 31, wherein: theset of data streams comprises (i) a first data stream having a firstformat and (ii) a second data stream having a second format; and theprogram further comprises a set of instructions for normalizing thefirst and second data streams into a same standardized format forconsumption by the machine learning engine.
 37. The non-transitorymachine-readable medium of claim 31, wherein: each of the data streamsrepresents the observed data traffic using a respective plurality ofdimensions; and the set of instructions for using the machine learningengine to analyze the set of data streams comprises sets of instructionsfor: determining an optimal number of dimensions for the data streams;and reducing dimensionality of the data streams to the determinedoptimal number of dimensions for further analysis by the machinelearning engine.
 38. The non-transitory machine-readable medium of claim31, wherein the set of instructions for using the machine learningengine to analyze the set of data streams comprises a set ofinstructions for determining, based on traffic statistics during a timeinterval during which the first gateway device observed the representeddata traffic, characteristics of additional data traffic processed bythe first gateway but not represented in the received set of datastreams.
 39. The non-transitory machine-readable medium of claim 38,wherein the determined traffic patterns are based on the representeddata traffic and the additional data traffic determined by the machinelearning engine.
 40. The non-transitory machine-readable medium of claim31, wherein the set of instructions for using the machine learningengine to analyze the set of data streams comprises a set ofinstructions for using a clustering algorithm to determine the trafficpatterns for the data traffic entering the second network from theexternal network.